uivalence between Priority Queues and Sorting in External Memory 



A priority queue is a fundamental data structure that maintains a dynamic ordered set 
of keys and supports the foUowig basic operations: insertion of a key, deletion of a key, and 
finding the smallest key. The complexity of the priority queue is closely related to that of 
sorting: A priority queue can be used to implement a sorting algorithm trivially. Thorup [10] 
proved that the converse is also true in the RAM model. In particular, he designed a priority 
queue that uses the sorting algorithm as a black box, such that the per-operation cost of the 
priority queue is asymptotically the same as the per-key cost of sorting. In this paper, we 
prove an analogous result in the external memory model, showing that priority queues are 
computationally equivalent to sorting in external memory, under some mild assumptions. The 
reduction provides a possibility for proving lower bounds for external sorting via showing a lower 
bound for priority queues. 
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Abstract 



1 Introduction 



The priority queue is an abstract data structure of fundamental importance. A priority queue 
maintains a set of keys and support the following operations: insertion of a key, deletion of a key, 
and findmin, which returns the current minimum key in the priority queue. It is well known that a 
priority queue can be used to implement a sorting algorithm: we simply insert all keys to be sorted 
into the priority queue, and then repeatedly delete the minimum key to extract the keys in sorted 
order. Thorup [10] showed that the converse is also true in the RAM model. In particular, he 
showed that given a sorting algorithm that sorts N keys in NS{N) time, there is a priority queue 
that uses the sorting algorithm as a black box, and supports insertion and deletion in 0{S{N)) 
time, and findmin in constant time. The reduction uses linear space. The main implication of 
this reduction is that we can regard the complexity of internal priority queues as settled, and just 
focus on establishing the complexity of sorting. Algorithmically, it also gives new priority queue 
constructions by using the fastest (integer) sorting algorithms currently known: an 0(A^ log log A'") 
deterministic algorithm by Han [6] and an 0(N ^/log log N) randomized one by Han and Thorup [7]. 

In this paper, we prove an analogous result in the external memory model (the I/O model), 
showing that priority queues are almost computationally equivalent to sorting in external memory. 
Wc design a priority queue that uses the sorting algorithm as a black box, such that the update cost 
of the priority queue is essentially the same as the per- key I/O cost of the sorting algorithm. The 
priority queue always has the current minimum key in memory so findmin can be handled without 
I/O cost. Our priority queue is a non-trivial generalization of Thorup's, which is fundamentally an 
internal structure. The main reasons why Thorup's structure does not work in the I/O model are 
that it cannot flush the buffers I/O-efficiently, and that it does not specify any order for performing 
the flush and rebalance operations. Moreover, deletions are supported in a very different way in 
the I/O model; we have to do it in a lazy fashion in order to achieve I/O-efHciency. 

1.1 Our results 

Let us first recall the standard I/O model [1]: The machine consists of an internal memory of 
size M and an infinitely large external memory. Computation can only be carried out in internal 
memory. The external memory is divided into blocks of size B, and data is moved between internal 
and external memory in terms of blocks. We measure the complexity of an algorithm by counting 
the number of I/Os it performs, while internal memory computation is free. 
Our main result is stated in the following theorem: 

Theorem 1.1. Suppose we can sort up to N keys in NS{N)/B I/Os in external memory, where S 

is a non-decreasing function. Then there exists an external priority queue that uses linear space and 
supports a sequence of N insertion and deletion operations in 0(;g X^j>o 'S'(-B log^*^ ^)) amortized 
I/Os per operation. Findmin can he supported without I/O cost. The reduction uses 0{B) internal 
memory and is deterministic. 

The first implication of Theorem 1.1 is that if the main memory has size U{Blog^'^^ ^) for any 
constant c, then our priority queue supports insertion and deletion with 0{S{N)/B) amortized 

I/O cost. This is because S{N) = when N < M. Even if M = 0{B), the reduction is still tight 
as long as the function S grows not too slowly. More precisely, we have the following corollary: 

Corollary 1.1. For S{N) = n{2'°^ b), the priority queue supports updates with 0{S{N) / B) 
amortized I/O cost; for S{N) = o(2^°^ "b )^ the priority queue supports updates with 0{S{N) log* %/B) 
amortized I/O cost. 



1 



The first part can be verified by plugging S{N) = 2 ^^ b into Theorem 1.1 and showing that 
the S{B\og^'^^ ^)'s decrease exponentially with i. For the second part, we simply relax all the 

^(Blog^') ^)'s to S{N). Note that 2'°s* ^ = o(log('=) f ) for any constant c, so it is very unlikely 
that a sorting algorithm could achieve S{N) = o(2 °^ s ). No such algorithm is known, even in the 
RAM model. Therefore, we can essentially consider our reduction to be tight. 

1.2 Related work 

Sorting and priority queues have been well studied in the comparison-based I/O model, in which the 

keys can only be accessed via comparisons. Aggarwal and Vittcr [1] showed that 0(^log^/^ ^) 
I/Os are sufficient and necessary to sort N keys in the comparison-based I/O model. This bound 
is often referred to as the sorting hound. If the comparison constraint is replaced by the weaker 
indivisibility constraint, there is an J7(min{^ logM/s ^) -^}) lower bound, known as the permuting 
hound. The two bounds are the same when ^log^^/^^ < N; it is conjectured that for this 
parameter range, ri(^ logjy^/^ ^) is still the sorting lower bound even without the indivisibility 
constraint. For ^logjy///^ ^ > A^, the current situation in the I/O model is the same as that in 
the RAM model, that is, the best upper bound is just to use the best RAM algorithm (which 
has 0(Arioglog A?^) time deterministically or 0{Ny/\og log N) time randomized) naively in external 
memory ignoring the blocking at all, and there is no non-trivial lower bound. When the block 
size is not too small, none of the RAM sorting algorithms works better than the comparison-based 
one, which makes the situation "cleaner". Thus, a sorting lower bound (without any restrictions) 
has been considered to be more hopeful in the I/O model (with B not too small) than in the 
RAM model, and it was posed as a major open problem in [1]. Thus, our result provides a 
way to approach a sorting lower bound via that of priority queues, while data structure lower 
bounds have been considered (relatively) easier to obtain than (concrete) algorithm lower bounds 
(except in restricted computation models), as witnessed by the many recent strong cell probe lower 
bounds for data structures, such as [8, 9] among many others. However, our result docs not offer 
any new bounds for priority queues because we do not know of a better sorting algorithm than 
the comparison-based ones in the I/O model (and the conjecture is that they do not exist when 

f logM/Bf <A^)- 

Since a priority queue can be used to sort N keys with N insertion and N deletemin operations, 
it follows that r2(-^ logjy^/^ ^) is also a lower bound for the amortized I/O cost per operation for 
any external priority queue, in the comparison-based I/O model. There are many priority queue 
constructions that achieve this lower bound, such as the buffer tree [2], M/B-axy heaps [5], and 
array heaps [4]. See the survey [11] for more details. However, they do not use sorting as just a 
black box, and cannot be improved even if we have a faster external sorting algorithm. Thus they 
do not give a priority queue-to-sorting reduction. The extra OiXogj^j^ ^) factor comes from a tree 
structure with fanout 0{M/B) within the priority queue construction, and a key must be moved 
f2(logji^/B ^) times to "bubble up" or "bubble down". 

Arge et al. [3] developed a cache-oblivious priority queue that achieves the sorting bound with 
the tall cache assumption, that is, M is assumed to be of size at least B^. We note that their 
structure can serve as a priority queue-to-sorting reduction in the I/O model, by replacing the 
cache-oblivious sort with a sorting black box. The resulting priority queue supports all operations 
in ^i>oS{N^'^/^y) amortized I/Os if the sorting algorithm sorts N keys in NS{N)/B I/Os. 
However, this reduction is not tight for S{N) = 0(loglog ^), and there seems to be no easy way 
to get rid of the tall cache assumption, even if the algorithm has the knowledge of M and B. 
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2 Structure 



In this section, we describe the structure of our priority queue. In the next section, we show how 
this structure supports various operations. Finahy we analyze the I/O costs of these operations. 

The priority queue consists of multiple layers whose sizes vary from N to cB, where c is some 
constant to be determined later. The z'th layer from above has size G(Blog(^) f ), for i > 0, and 
the priority queue has 0{log* N) layers. For the sake of simplicity we will refer to a layer by its 
size. Thus the layers from the largest to the smallest are layer A^, layer i?log^, layer cB. 
Layer cB is also called head, and is stored in main memory. Given a layer X, its upper layer and 
lower layer are layer B2^ and layer 5 log ^, respectively. We use *x to denote B2^ and $x to 
denote B log ^ . The priority queue maintains the invariant that the keys in layer $x are smaller 
than the keys in layer X. In particular, the minimum key is always stored in the head and can be 
accessed without I/O cost. 

We maintain a main memory buffer of size 0{B) to accommodate incoming insertion and deletion 
operations. In order to distribute keys in the memory buffer to different layers I/O-efficiently, we 
maintain a structure called layer navigation list. Since this structure will also be used in other 
components of the priority queue, we define it in a unified way. Suppose we want to distribute 
the keys in a buffer B to t sub-structures Si, S2, ■ ■ ■ St- The keys in different sub-structures are 
sorted relative to each other, that is, the keys in Si are less or equal to the keys in 5^+1. Each 
sub-structure Si is associated with a buffer Bi, which accommodates keys transferred from B. The 
goal is to distribute the keys in B to each Bi I/O-efficiently, such that the keys that go to Bi have 
values between the minimum keys of Si and Si+i . A navigation list stores a set of t representatives, 
each representing a sub-structure. The representative of Si, denoted rj, is a triple that stores the 
minimum key of Si, the number of keys stored in Bi, and a pointer to the last non-full block of the 
buffer Bi- The representatives are stored consecutively on the disk, and are sorted on the minimum 
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keys. The layer navigation list is built for the 0(log* N) layers, so it has size 0(log* N). Please see 
Figure 1. 

Now we will describe the structures inside a layer X except layer cB, which is always in the 
main memory. First we maintain a layer buffer oi size $x/2 to store keys flushed from the memory 
buffer. The main structure of layer X consists of O(log^) levels with exponentially increasing 
sizes. The j'th level from the bottom, denoted level j, has size Q{8^^x)- We also keep the invariant 
that the keys in level j are less or equal to the keys in level j ' + 1. We maintain a level navigation 
list of size G(log which represents the log ^ levels. Most keys in level j are stored in 6(8-') 
disjoint base sets, each of size 0(<I>x)- The base sets, from left to right, are sorted relative to each 
other, but they are not internally sorted. Other than the base sets, there is a level buffer of size 
S'B, which is used to temporarily accommodate keys before distributing them to the base set. We 
also maintain a base navigation list of size 0(8-^) for the base sets. Note that we do not impose the 
level structures on layer cB since it can fit in the main memory. The components of the priority 
queue are illustrated in Figure 1. 

Here we provide some intuition for this complicated structure. We first divide the keys into 
exponentially increasing levels, which in some sense is similar to building a heap. However, the 
0(log A^)-level structure implies that a key may be moved O(logA^) times in its lifetime. To 
overcome this, we group the keys into base sets of logarithmic sizes so that we can move more keys 
with the same I/O cost (we will move pointer to the base sets rather than the base sets themselves). 
Finally, when a base set gets down to level 0, we need to recursively build the structure on it, which 
results in the 0(log* N) layers. 

Let Ix denote the top level of layer X. We use Bx to denote the layer buffer of layer X and Bj 
to denote the level buffer of level j when the layer is specified. Our priority queue maintains the 
following invariants for layer X: 

Invariant 1. The layer buffer Bx contains at most ^^x keys; the level buffer Bj at layer X 
contains at most 8^B keys. 

Invariant 2. The layer buffer Bx only contains keys between the minimum keys of layer X and 
its upper layer. The level buffer Bj only contains keys between the minimum keys of level j and its 
upper level. 

Invariant 3. A base set in layer X has size between ^^x o-nd 2^x; level j of layer X, for 
J = 0, 1, . . . , — 1; has size between 2 • S-'^x and 6 • 8-'$X; and level Ix has size between 2 • 8^^^x 
and 40 • S'^^x- 

Invariant 4. The head contains at most 2cB keys. 

Note that when we talk about the size of a level, we only count the keys in its base sets and 
exclude the level buffer. The top level has a slightly different size range so that the construction 
works for any value of X. 

We say a layer buffer, a level buffer, a base set, a level or the head overflows if its size exceeds 
its upper bound in Invariant 1, 3 or 4; we say a base set, a level underflows if its size gets below 
the lower bound in Invariant 3. 

3 Operations 

Recall that the priority queue supports three operations: insertion, deletion, and findmin. Since 
we always maintain the minimum key in the main memory (it is in either the head or the memory 
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buffer), the cost of a findmin operation is free. We process deletions in a lazy fashion, that is, when 
a deletion comes we generate a delete signal with the corresponding key and a time stamp, and 
insert the delete signal to the priority queue. In most cases we treat the delete signals as norm 
insertions. We only perform the actually delete in the head so that the current minimum key is 
always valid. To ensure linear space usage we perform a global rebuild after every updates. 

Our priority queue is implemented by three general operations: global rebuild, flush, and rebalance. 
A global rebuild operation sorts all keys and processes all delete signals to maintain linear size. 
A flush operation distributes all keys in a buffer to the buffers of corresponding sub-structures to 
maintain Invariant 1. A rebalance operation moves keys between two adjacent sub-structures to 
maintain Invariant 3. 



We conduct the first global rebuild when the internal memory buffer is full. Then, after each 
global rebuild, we set N to be the number of keys in the priority queue, and keep it fixed until the 
next global rebuild. A global rebuild is triggered whenever layer N (in fact, its top level) becomes 
unbalanced or the priority queue has received new updates since the last global rebuild. We 
show that it takes 0{NS{N)/B) I/Os to rebuild our priority queue. We first sort all keys in the 
priority queue and process the delete signals. Then we scan through the remaining keys and divide 
them into base sets of size ^^r, except the last base set which may be smaller. This base set is 
merged to its predecessor if its size is less or equal to ^^n- The first base set is used to construct 
the lower layers, and the rest are used to construct layer N. To rebuild the O(log^) levels of 
layer N, we scan through the base sets, and take the next 4 • 8^ base sets to build level j, for 
j = 0, 1, 2, . . .. Note that the base navigation list of these 4 • 8^ base sets can be constructed when 
we scan through the keys in the base sets. The level rebuild process stops when we encounter an 
integer 1^ such that the number of remaining base sets is more than 4 • 8^'^ , but less or equal to 
4 • (8'^ + 8'^+^) = 36 • Then we take these base sets to form the top level of layer N. After 
the global rebuild, level j has size 4P^n, and the top level Zjv has size between (4 • 8'-~ — and 
(36 • 8^^ + |)$Ar. For X = B log ^, B log*^^^ . . ,cB, layer X are constructed recursively using 
the same algorithm. All buffers are left empty. 

Based on the global rebuild algorithm, the priority queue maintains the following invariant be- 
tween two global rebuilds: 

Invariant 5. The top level Ix in layer X is determined by the maximum Ix such that 



The number of layers and the number of levels in each layer will not change between two global 
rebuilds. 

As a result of Invariant 5, we have the following lemma: 

Lemma 3.1. Suppose the top level in layer X is level Ix- Then Ix is an integer that satisfies the 
following inequality: 



3.1 Global Rebuild 




4-8'-^$x <X <40-8'^$x. 
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3.2 Flush 



We define the fiush operation in a unified way. Suppose we have a buffer B and k sub-structures 
Si,S2, - ■ ■ ,Sk- Each Si is associated with a buffer Bi, and a navigation Ust L of size k is maintained 
for the k sub-structures. To flush the buffer B we first sort the keys in it. Then we scan through 
the navigation hst, and for each representative in L, we read the last non-full block of Bi to the 
memory, and fill it with keys in B. When the block is full, we write it back to disk, and allocate a 
new block. We do so until we encounter a key that is larger than the key in rj+i. Then we update 
Tj, and advance to rj+i. The I/O cost for a flush is the cost of sorting a buffer of size \B\ plus one 
I/O for each sub-structure, so we have the following lemma: 

Lemma 3.2. The I/O cost for flushing keys in buffer B to k sub-structures is bounded by 0( l^l'^l^l^ -|- 
k). 

There are three individual flush operations. A memory flush distributes keys in the internal 

memory buffer to 0(log*iV) layer buffers; a layer flush on layer X distributes keys in the layer 
buffer to 0(log ^) level buffers in the layer; and a level flush on level j at layer X distributes keys 
in the level buffer to 6(8-') base sets in the level. 



3.3 Rebalance 



Rebalancing the base sets. Base rebalance is performed only after a level flush, since this is the 
only operation that causes a base set to be unbalanced. Consider a level flush in level j of layer 
X. Suppose the base set A overflows after the fiush. To rebalance A we sort and scan through 

the keys in it, and split it into base sets of size If the last base set has less than keys 

we merge it into its predecessor. Note that any base set coming out of a split has between 
and keys, so it takes at least new updates to any of them before it initiates a new split. 
Note that after the split we should update the representatives in the base navigation list. This can 
be done without additional I/Os to the level flush operation: We store all new representatives in a 
temporary list and rebuild the navigation list after all overflowed base sets are rebalanced in level 
j. A base set never underflows so we do not have a join operation. 

Rebalancing the levels. We define two level rebalance operations: level push and level pull. 
Consider level j at layer X. When the number of keys in level j (except the top level) gets to more 
than 6 • 8^^x, a level push operation is performed to move some of its base sets to the upper level. 
More precisely, we scan through the navigation list of level j to find the first representative such 
that the number of keys before is larger than 4 • 8-^ $x • Then we split the navigation list of level 
j around rk and attach the second half to the navigation list of level i + 1. Note that by moving 
the representatives we also move their corresponding base sets to level j + 1. By Invariant 3, the 
number of keys in a base set is at most 2^x, so the new level j has size between 4 • 8^^x and 
(4 • 8-^ -f 2)$x- Finally, to maintain Invariant 2 we sort level buffer Bj and move keys larger than 
the rfe to the level buffer Bj+i. 

Conversely, if the number of keys in level j gets below 2 ■ 8^^x (except the top level), a level 
pull operation is performed. We cut a proportion of the navigation list of level j + 1 and attach it 
to the navigation list of level j, such that the number of keys in level i becomes between 4 • 8^^x 
and (4 • 8^ — 2)$x- We also sort Bj+i, the buffer of level j + 1, and move the corresponding keys 
to level buffer Bj. 

Observe that after a level push/pull, the number of keys in level j is between (4 • 8-^ — 2)$x and 
(4 • 8-^ -I- 2)$x, so it takes at least Q{8^^x) new updates before the level needs to be rebalanced 
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again. The main reason that we adopt this level rebalance strategy is that it does not touch all 
keys in the level; the rebalance only takes place on the base navigation lists and the keys in the 
level buffers. 

Rebalancing the layers. When the top level Ix of layer X becomes unbalanced, we can no 
longer rebalance it only using navigation list. Recall that its upper level is level in layer ^x- 
For simplicity we will refer to the the two levels as level Ix and level 0, without specifying their 
layers. We also define two operations for rebalancing a layer: layer push and layer pull. A layer 
push is performed when the layer overflows, that is, the number of keys in level Ix gets more than 
40 • S^'^^x- In this case we sort all keys in level Ix and level together, then use the first 4 • 8'^^>x 
keys to rebuild level Ix and the rest to rebuild level 0. Recall that to rebuild a level we scan through 
the keys and divide them into base sets of size ^x, except the last one which has size between 
and |<1*X5 and then we scan through the keys again to build the base navigation list. Note that 
the rebuild operation will change the minimum key in layer ^Xi so we update the layer navigation 
list accordingly. Finally we sort the keys in the layer buffer Bx and the level buffer Bi^ , and move 
the keys larger than the new minimum key of layer ^x to the level buffer Bq. 

A layer pull operation is performed when the layer underflows, that is, there are less than 2-8'^ ^x 
keys in level Ix- A layer pull proceeds in the same way as a layer pTish docs, except for the last 
step. Here we sort the layer buffer B^^ and the level buffer Bq and move the keys smaller than the 
new minimum key to the level buffer Bi-^ . After a layer push or pull, the number of keys in level 
Zx is 4 • S-'^x- By lemma 3.1, we have 40 • S'^^x > X, so it takes at least 2 • 8'^$x = ^{X) new 
updates to layer X before we initiate a new push or a pull again. 

Note that since we do not impose the level structure on the head layer cB, we need to design the 
layer push and layer pull operations specifically for it. A layer push is performed when the number 
of keys in the head gets to more than 2cB. We sort all keys in it and level of layer '^cB, and use 
the first cB keys to rebuild the head and the rest to rebuild level 0. A layer pull is performed when 
the head becomes empty. The operation processes in the same way as a layer push does, except 
that after rebuilding both levels, we sort the layer buffer B^n^g and the level buffer Bq together, and 
move the keys smaller than the new minimum key of layer Bxsr^g to the head. 

3.4 Scheduling Flush and Rebalance Operations 

In order to achieve the I/O bounds in Theorem 1.1, we need to schedule the operations delicately. 
Whenever the memory buffer overflows we start to update the priority queue. This process is 
divided into three stages: the flush stage, the push stage, and the pull stage. In the flush stage we 
flush all overflowed buffers and rebalance all unbalanced base sets; in the push stage we use push 
operations to rebalance all overflowed layers and levels. We treat delete signals as insertions in the 
flush stage and the push stage. In the pull stage we deal with delete signals and use pull operations 
to rebalance all underflowed layers and levels. 

In the flush stage, we initialize a queue Qo to keep track of all overflowed buffers and a doubly 
linked list Lg to keep track of all overflowed levels. The buffers are flushed in a BFS fashion. First 
we flush the memory buffer into 0(log* N) layer buffers. After flushing the memory buffer, we 
insert the representatives of the overflowed layer buffers into Qo, from bottom to top. We also 
check whether the head overflows after the memory flush. If so, we insert its representatives to the 
beginning of Lq. Then we start to flush the layer buffers in Qo- Again, when flushing a layer buffer 
we insert the representatives of the overflowed level buffers to Qo from bottom to top. After all 
layer buffers are flushed, we begin to flush level buffers in Qo- After each level flush, we rebalance 
all unbalanced base sets in this level, and if the level overflows we add the representative of this 
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level to the end of Lo- Note that the representatives in Lq are sorted on the minimum keys of the 
levels. 

After all overflowed level buffers are flushed, we enter the push stage and start to rebalance levels 
in Lq in a bottom-up fashion. In each step, we take out the first level in Lq (which is also the 
current lowest overflowed level) and rebalance it. Suppose this level is level j of layer X. If it is not 
the top level or the head layer we perform a level push; otherwise we perform a layer push. Then 
we delete the representative of this level from Lq. A level push may cause the level buffer of level 
J + 1 to be overflowed, in which case we flush it and rebalance the overflowed base sets. Then we 
check whether level j + 1 overflows. If so, we insert the representative of level j + 1 to the head of 
Lq (unless it is already at the beginning of Lq) and perform a level push on level j + 1- Otherwise 
we take out a new level in Lq and continue the process. When the top level of layer N become 
unbalanced we simply perform a global rebuild. 

After rebalancing all levels, we enter the pull stage and start to process the delete signals. This 
is done as follows. We flrst process all delete signals in the head. If the head becomes empty 
we perform a layer pull to get more keys into the head. This may cause higher levels or layers to 
underflow, and we keep performing level pulls and layer pulls until all levels and layers are balanced. 
Consider a level pull or layer pull on level j of layer X. After the level pull or layer pull the level 
buffer Bj may overflow. If so, we flush it and rebalance the base sets when necessary. Note that 
this may cause the size of level j to grow, but it will not overflow, as we will show later, so that we 
do not need push operations in the pull stage. After all levels and layers are balanced, we process 
the delete signals in the head again. We repeat the pull process until there are no delete signals 
left in the head and the head is non-empty. 

3.5 Correctness 

It should be obvious that the flush and the push stage will always succeed. The following two 
lemmas guarantee that the pull stage will also succeed. 

Lemma 3.3. When we perform a level pull on level j, there are enough keys in level j + 1 to 
rebalance level j; When we perform a layer pull on layer X, there are enough keys in level of 
layer *x to rebalance level Ix ■ 

Proof. Recall that a level pull on level j transfers at most 4 • 8^$x keys from level j + 1 to level 
j. Since we always perform pull operations in a bottom-up fashion in the pull stage, and all levels 

and layers are balanced before the pull stage, it follows that level j + 1 is always balanced when 
performing a pull operation on level j. This implies that level j + 1 has at least 2 • %^^^^x keys 
when performing a pull operation on level j, which is sufficient to supply the level pull operation. 

For a layer pull on layer X other than the head, recall that the operation transfers at most 
4 • S'-^^x keys from level of layer ^x to level Ix- By similar argument we know level is 
balanced, so it has at least 2X keys. Following Lemma 3.1, wc have IX > 8 • S'^^x^ so it suffices 
to supply the layer pull operation. The same argument also works for a layer pull on the head, 
since it acquires at most cB keys from the upper level, and the level contains at least 2cB keys. □ 

Lemma 3.4. A level or a layer never overfl,ows in the pull stage. 

Proof. Consider a level pull on level j of layer X. Recall that since we move some keys from Bj-^i 
to Bj, it is possible that Bj overflows and we need to perform a level flush on level j. We claim 
that after this level flush, level j is still balanced. For a proof, observe that level j has size between 

4 • 8^$x and (4 • 8^' - 2)$x after the level pull, so it takes at least 2 • 8^B\og § > 2- 8^cB new 
updates before level j overflows. Since the level pull transfers at most 8^~^^B keys from -Sj+i, after 



8 



the level pull, Bj has less or equal to 8^B + 8-^+^5 = 9 • 8^B keys. Setting c > 5 allows level j to 
be still balanced after the level flush. This proves that a level never overflows in the pull stage. 

Now consider a layer pull on layer X other than the head. Recall that we move some keys from 
the layer buffer and level buffer Bq to Bi-^^, it is possible that Bi-^^ overflows and we need 
to perform a level flush on the new level Ix- We claim that after this level flush, level Ix is still 
balanced. For a proof, observe that after the layer pull, it takes at least 36 • 8'-^ log ^ new updates 
before level Ix overflows. Since the layer pull transfers at most X/2 keys from the layer buffer Bqj^ 
and at most 8B keys from the level buffer Bq to the level buffer of level Ix , the level flush operation 
flushes at most X/2 + 8B + 8'^ 5 keys to level Ix- By Lemma 3.1 we have 

36-8^^$x = 20-8'^$x + 16-8'^$x 

> X/2 + 16 -8'^ 5 log ^ 

B 

> X/2 + 8B + 8^^B. 

So level Ix is still balanced after the level flush. Finally, consider a layer pull on the head. Recall 
that it takes at least cB new update to the head before it overflows. Since the head acquires at 
most cB/2 keys from the layer buffer B^lJ^g, and at most 8B keys from the level buffer Bq, we can 
set c > 16 such that cB > cB/2 + 8B, so the head will remain balanced after the layer pull. This 
proves that a layer never overflows in the pull stage. □ 



4 Analysis of Amortized I/O Complexity 

We analyze the amortized I/O cost for each operation during A^/8 updates. We will show that 
the amortized I/O cost per update is bounded hy Z^i=o S{B log^'^^ ^)), and Theorem 1.1 will 
follow. 

Global rebuild. Recall that the I/O cost for a global rebuild is 0{NS{N)/B) I/Os. We claim 
that during A^/8 updates only a constant number of global rebuilds are needed, so the amortized 
I/O cost per update is bounded by 0{S{N)/B). This can be verified by the fact that a global 

rebuild can only be triggered by A^/8 new updates or that the level Ijsf becomes unbalanced, and 
after a global rebuild it takes Q{N) updates before level l]\f becomes unbalanced again. 

Flush. Wc analyze the I/O cost of three different flush operations. For memory flush, we sort a set 
of B keys in the memory and merge them with a navigation list of size 0(log* A'^). By Lemma 3.2, 
the I/O cost is O (log* AT). Therefore we charge 0( '°^g^ ) I/Os for each of the updates in the 
memory buffer. Now consider a layer flush at layer X. Let \Bx\ denote the number of updates 
in the layer buffer. By Invariant 1 the layer flush operation is performed only if \Bx\ > B\og^. 
There are 0(log ^) level buffers, so by Lemma 3.2, the I/O cost is 

/ \Bx\Si\Bx\) X\ ^ f \Bx\Si\Bx\) \Bx\\ 

\ B ^^x) \ B B ) 

'\Bx\S{\Bx 



O 



B 



^( Wx\s{N) y 

Thus, we can charge 0{S{N)/B) I/Os for each of the B updates in the layer buffer. 
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Next, consider a level flush at level j in layer X. Let \Bj\ denote the number of updates in the 
buffer when we perform the flush operation, and by Invariant 1 we have \Bj\ > & B. Recall that 
the size of the navigation list is 0(8^), so by Lemma 3.2, the I/O cost is 

p( |g.imi) I oA ^ ^f \B^S(\B^) ^ \B,\ 
B I \ B B 



^l^ \B,\Si\B,\) ^^_^l^ \B,\SiN) ^ 



Therefore we can charge 0{S{N)/ B) I/Os for each of the \Bj\ updates in the level buffer. 

Rebalancing the base sets. Consider a rebalance operation for a base set A at layer X. When 
A overflows we sort and divide it into equal segments. So the I/O cost for a base set rebalance 
can be bounded by the sorting time of 0{\A\ + $x) updates. Note that there are at least 
updates to A since the last rebalance operation on it, and by Invariant 3 we have \A\ > 2$x- Thus, 
the amortized I/O cost per update is 0{S{N)/ B). 

Rebalancing the levels. Wc first consider a level push operation on level j of layer X. The 
operation cuts the base navigation list of level j, takes the first half to form a new level j, and 
attaches the rest to level j + 1. The I/O cost for this cut-attach procedure is @{8^ /B + 1), since the 
navigation list is sorted and stored consecutively on disk. Then the operation sorts and redistributes 
the level buffer Bj. Recall that wc always fiush the level buffer before rebalancing the level, so we 
have \B\ < 8^B when the level push is performed. The I/O cost for sorting and redistributing 
Bj is bounded by 0{8^B ■ S{8^B)/B) = 0{8^S{X)). Note that after a level push, it takes at 
least Q{8^^x) new updates to level j before it overflows again. So during A^/8 updates at most 
0{N/{8^^x)) level push operations are performed on level j. It follows that the I/O cost of all 
level push operations on level j is bounded by 

V tf^xj \Biog^J 

We charge 0{S{X)/B\og ^) for each update and for each level in layer X. Since there are 0(log ^) 
levels in layer X, wc charge 0{S{X)/B) I/Os for each update in layer X. Summing up all layers, 
the amortized I/O cost for each update is J2i=o log^*-* ^)). A similar argument shows that 
the amortized I/O cost for the level pulls is the same, except that the I/O cost is amortized only 
on the delete signals. 

Rebalancing the layers. Consider a layer push operation on layer X. It takes the keys in level 
Ix of layer X and level of layer ^x, sorts them, and rebuilds both levels. Since both levels have 
size 0{X), the I/O cost is 0{XS{X)/B). We also note that after a layer push operation, it takes at 
least Q{X) updates to level Ix before it goes unbalanced again. That means at most 0{N/X) layer 
rebalance operations are needed. So the I/O cost for the layer rebalances of layer X during the 
A^/8 updates is 0{N S{X) / B). We can charge 0{S{X) / B) I/Os for each update and each layer, 
and summing up all layers, it is amortized ^^1=^8 {B\og^^^ ^) I/Os for each update. Similar 
argument shows that the amortized I/O cost for layer pulls is the same, except that the total I/O 
cost is amortized only on the delete signals. 

Scheduling the operations. Note that in the schedule we need to pay some extra I/Os for 
maintaining the queue Qo and doubly linked list Lq. We observe that an update to Qo or Lg would 

trigger a flush or rebalance operation later, and the cost of a flush or rebalance operation is at least 
1 I/O. So Qo and Lq can be maintained without increasing the asymptotic I/O cost. 
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